The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Abstract
FLASHMM is a package for single-cell differential expression analysis using linear mixed-effects models (LMMs). Mixed-effects models have become a powerful tool in single-cell studies due to their ability to model intra-subject correlation and inter-subject variability. However, classic LMM estimation methods face limitations in computational speed and memory usage when applied to large-scale single-cell transcriptomics data. The FLASHMM package provides a fast and scalable approach to address scalability and memory efficiency in LMM fitting. This vignette describes the methods for LMM estimation and hypothesis testing, as well as the R functions for implementing these methods, and demonstrates the use of the package through an example.Consider the linear mixed-effects model (LMM) as expressed below (Searle, Casella, and McCulloch 2006) \[\begin{equation}\label{eq:lmm} y = X\beta + Zb + \epsilon, \end{equation}\] where \(y\) is an \(n\times 1\) vector of observed responses, \(X\) is an \(n\times p\) design matrix for fixed effects \(\beta\), \(Z\) is an \(n\times q\) design matrix for random effects \(b\), and \(\epsilon\) is an \(n\times 1\) vector of residual errors. The term of random effects may be a combination of various components, \[ Zb = Z_1 b_1 + \cdots + Z_K b_K, \] where \(Z=[Z_1,\ldots,Z_K]\), \(b=[b^T_1,\ldots,b^T_K]^T\), \(K\) is the number of random-effect components, and \(Z_k\) is an \(n\times q_k\) design matrix for the \(k\)-th random-effect component. The superscript \(T\) denotes a transpose of vector or matrix. The basic assumptions are as follows:
Here \(\mathbf{0}\) is a vector or matrix of zero elements, \(I_n\) is an \(n\times n\) identity matrix, and \(\sigma^2_k\) and \(\sigma^2\) are unknown parameters, called variance components.
Hartley and Rao (1967) developed the maximum likelihood (ML) method for estimating the LMM parameters (fixed effects and variance components). Patterson and Thompson (1971) proposed a modified maximum likelihood procedure which partitions the data into two mutually uncorrelated parts, one being free of the fixed effects used for estimating variance components, called restricted maximum likelihood (REML) method. The REML estimator is unbiased. The ML estimator of variance components is biased in general. Both methods are asymptotically identical for estimating variance components. With variance components, \(\theta\), estimated, the fixed effects estimated by either ML or REML are given as follows \[ \hat\beta = (X^TV_{\theta}^{-1}X )^{-1}X^TV_{\theta}^{-1}y, \] with covariance matrix \[var(\hat\beta) = (X^TV_{\theta}^{-1}X)^{-1},\] where \(\theta = (\theta_0, \theta_1,\ldots, \theta_K)\), \(\theta_0 = \sigma^2\), \(\theta_k = \sigma^2_k\), and \[V_{\theta} = \theta_0 I_n + \theta_1 Z_1Z_1^T + \ldots + \theta_K Z_KZ_K^T.\]
Estimating the variance components by either ML or REML is a numerical optimization problem. Various iterative methods based on the log likelihood, called gradient methods, have been proposed (Searle, Casella, and McCulloch 2006). The gradient methods are represented by the iteration equation \[\begin{equation}\label{eq:gradient} \theta^{(i+1)} = \theta^{(i)} + \Gamma(\theta^{(i)})\frac{\partial l(\theta^{(i)})}{\partial\theta}, \end{equation}\] where \(\partial l(\theta)/\partial\theta\) is the gradient of the log likelihood function, and \(\Gamma(\theta)\) is a modifier matrix of the gradient direction, which can be specified by Newton–Raphson, Fisher scoring or average information. The Newton-Raphson method uses the inverse Hessian matrix as a modifier, while the Fisher scoring method uses the inverse of the expectation of the Hessian matrix. The average information method uses the inverse of the average Hessian matrix and its expectation. We use the Fisher scoring method. The Fisher scoring method is more stable than others because the Hessian matrix may not be positive definite but its expectation is positive definite.
The hypotheses for testing fixed effects and variance components can be respectively defined as \[ H_{0, i}: \beta_i = 0 ~~\text{versus}~~H_{1,i}: \beta_i\ne 0, \] \[ H_{0, k}: \theta_k \leq 0 ~~\text{versus}~~H_{1,k}: \theta_k > 0, \] where \(\theta_k\), \(k=1, \ldots, K\), represent the parameters of variance components \(\sigma^2_k\) but are allowed to be negative. The lower boundary of the parameters, \(\theta\), can be the negative value such that the variance-covariance matrix, \(V_{\theta}\), remains definable (positive definite). The negative lower boundary exists and can be less than \(- \sigma^2/\lambda_{max}\), where \(\lambda_{max} > 0\) is the largest singular value of \(ZZ^T\). If \(\theta_k > 0\), then \(\sigma_k^2 = \theta_k\) is definable and the mixed model is well-specified. Otherwise, if \(\theta_k \le 0\), the mixed model is miss-specified and the \(k\)-th random-effect component shouldn’t be included.
Allowing the parameters of variance components to take negative value avoids the zero boundary at the null hypothesis for the variance components. Consequently, the asymptotic normality of the maximum likelihood estimation at the null hypothesis holds under regularity conditions, which enables us to use z-statistics or t-statistics for hypothesis testing of the fixed effects and variance components. The t-statistics for fixed effects are given by \[\begin{equation} \label{eq:tcoef} T_i = \frac{\hat\beta_i}{\sqrt{var(\hat\beta_i)}} = \frac{\hat\beta_i}{\sqrt{var(\hat\beta)_{ii}}} ~\sim ~t(n - p). \end{equation}\] The t-statistic for a contrast, a linear combination of the estimated fixed effects, \(c^T\hat\beta\), is \[\begin{equation} \label{eq:tcontrast} T_c = \frac{c^T\hat\beta}{\sqrt{c^Tvar(\hat\beta) c}} \sim t(n-p). \end{equation}\] The test statistics for the parameters of variance components are given by \[\begin{equation} \label{eq:zvarcomp} T_{\theta_k} = \frac{\hat\theta_k}{\sqrt{[I(\hat\theta)^{-1}]_{kk}}} \sim N(0, 1), \end{equation}\] where \(I(\theta)\) is the Fisher information matrix.
We developed a summary statistics based algorithm for implementing the gradient method \(\eqref{eq:gradient}\). Let \(XX\), \(XY\), \(ZX\), \(ZY\), and \(ZZ\) denote a matrix, respectively, which define the summary statistics that are computed from cell-level data \(X\), \(Y\) and \(Z\) as follows: \[\begin{equation} \label{eq:sdata} \begin{array}{l} XX = X^TX, ~XY = X^TY^T,\\ ZX = Z^TX, ~ZY = Z^TY^T, ~ZZ = Z^TZ,\\ Ynorm = [y_1y_1^T, \ldots, y_my_m^T], \end{array} \end{equation}\] where \(Y = [y_1^T, \ldots, y_m^T]^T\) is a \(m\)-by-\(n\) matrix of gene expression profile with each row \(y_i\) corresponding to the expression of gene \(i\), \(i=1,\ldots,m\), \(m\) is the number of genes and \(n\) is the number of cells (sample size). The summary statistics can be precomputed and stored. The summary statistics based algorithm has a complexity of \(O(m(p^3 + q^3))\), which doesn’t depend on number of cells (sample size \(n\)). In single-cell differential expression (DE) analysis, the numbers of fixed and random effects, \(p\) and \(q\), are relatively small. Therefore, the algorithm is fast and scalable, and requires less computer memory.
FLASHMM package provides two functions, lmm and lmmfit, for fitting LMM. The lmm function uses summary statistics as arguments. The lmmfit function is a wrapper function of lmm, which directly uses cell-level data and computes the summary statistics inside the function. The lmmfit function is simple to be operated but it has a limitation of memory use. For extremely large scale data, we can precompute the summary-level data by \(\eqref{eq:sdata}\), and then use lmm function to fit LMM. FLASHMM provides lmmtest function to perform statistical test on the fixed effects and the contrasts of the fixed effects.
In summary, FLASHMM package provides the following main functions.
We use a simulated multi-sample multi-cell-type single-cell RNA-seq (scRNA-seq) dataset to illustrate how to utilize FLASHMM to perform single-cell differential expression analysis. In this example, we are interested in identifying the genes differentially expressed between two treatments (conditions) within a cell type.
We simulate the multi-sample multi-cell-type scRNA-seq data by simuRNAseq function in FLASHMM package. The simuRNAseq function can generate the scRNA-seq data, with or without a reference dataset. The reference dataset contains count data and metadata. The count data is a genes-by-cells matrix. The metadata must be a data frame containing three columns: samples (subjects), cell-types (clusters), and treatments (experimental conditions), and the three columns must be named ‘sam’, ‘cls’, and ‘trt’, respectively.
For simplicity and demonstration purposes, we use a simulated reference dataset to generate the scRNA-seq data. First we simulate the reference dataset.
##Generate a reference dataset by simuRNAseq function.
set.seed(2502)
refdata <- simuRNAseq(nGenes = 2000, nCells = 10000)
##counts
counts <- refdata$counts
##metadata
metadata <- refdata$metadata
head(metadata)
#> sam cls trt libsize
#> Cell1 A1 10 A 3783
#> Cell2 A3 10 A 3906
#> Cell3 B12 10 B 3956
#> Cell4 A7 3 A 3867
#> Cell5 B5 3 B 3974
#> Cell6 B13 4 B 3886
rm(refdata)
For the simulated reference dataset, we don’t need to change the metadata column names because the columns of samples, clusters, and treatments, are already named ‘sam’, ‘cls’, and ‘trt’, respectively. But, for a real biological reference dataset, we need to change the metadata column names.
Next we use the reference counts and metadata to simulate scRNA-seq data that contains 100 genes and 100,000 cells from 25 samples (subjects) and 10 clusters (cell-types) with 2 treatments. There are 10 differentially expressed genes.
##Generate the scRNA-seq data by simuRNAseq function.
set.seed(2503)
dat <- simuRNAseq(counts, metadata = metadata, nGenes = 100, nCells = 100000, nsam = 25,
ncls = 10, ntrt = 2, nDEgenes = 10, minbeta = 0.5, maxbeta = 1, var.randomeffects = 0.1)
str(dat)
#> List of 5
#> $ ref.mean.dispersion:'data.frame': 2000 obs. of 2 variables:
#> ..$ mu : num [1:2000] 3.7 3.192 0.963 4.519 1.927 ...
#> ..$ dispersion: num [1:2000] 1.119 0.841 0.861 0.905 1.667 ...
#> $ metadata :'data.frame': 100000 obs. of 4 variables:
#> ..$ sam : chr [1:100000] "A4" "B15" "B13" "B1" ...
#> ..$ cls : chr [1:100000] "4" "6" "3" "2" ...
#> ..$ trt : chr [1:100000] "A" "B" "B" "B" ...
#> ..$ libsize: num [1:100000] 3871 4223 4241 4189 4253 ...
#> $ counts : num [1:100, 1:100000] 1 10 2 0 0 2 1 1 5 0 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:100] "Gene1284" "Gene724" "Gene774" "Gene1640" ...
#> .. ..$ : chr [1:100000] "Cell7546" "Cell9224" "Cell3950" "Cell530" ...
#> $ DEgenes :'data.frame': 10 obs. of 3 variables:
#> ..$ gene : chr [1:10] "Gene1537" "Gene1225" "Gene864" "Gene1546" ...
#> ..$ beta : num [1:10] -0.776 0.977 0.819 -0.532 0.896 ...
#> ..$ cluster: chr [1:10] "1" "1" "10" "2" ...
#> $ treatment : chr "B"
rm(counts, metadata)
The simulated data contains a genes-by-cells matrix of expression counts and a data frame of metadata consisting of 3 columns: samples (sam), cell-types (cls) and treatments (trt).
We perform differential expression analysis of the simulated dataset using FLASHMM package. We are to identify the significantly differentially expressed genes between two treatments in a cell-type. The analyses involve following steps: LMM design, LMM fitting and hypothesis testing, exploring LMM fit and the differetially expressed (DE) genes.
LMM design: construct design matrices for fixed and random effects described as \(\eqref{eq:lmm}\), and compute the gene expression profile. The gene expression is taken as log-transformed count matrix, \[Y = \log(1+\text{counts}),\] in which each row corresponds to the expression profile for a gene. The design matrix for the fixed effects can be created by the function model.matrix, as follows \[ X = model.matrix(\sim 0 + log(library.size) + cell.type + cell.type:treatment), \] where the interaction term \(cell.type:treatment\) represents the treatment effect in a specific cell-type. \(library.size\) is a normalization factor of the scRNA-seq counts, calculated by the total sum of counts across all genes for each cell. We consider the subjects (samples) as random effects that reflect the intra-subject correlation and inter-subject variability. The design matrix for the random effects is given by \[ Z = model.matrix(\sim 0 + subject). \]
LMM fitting and hypothesis testing: We use either lmm or lmmfit function to fit LMMs for all genes. With the cell-level data matrices \(Y\), \(X\) and \(Z\), the LMMs can be fit by \[lmmfit(Y, X, Z, d = d),\] where \(d\) is the number of random-effects. For \(K\) components of random-effects, \(d = (q_1, \ldots, q_K)\), where \(q_k\) is the number of columns of the design matrix \(Z_k\) for the \(k\)-th random-effect component. For a large-scale data, the lmmfit function has a problem of computer memory limit. In this case, it is recommended to pre-compute and store the summary-level data: \(XX\), \(XY\), \(ZX\), \(ZY\), \(ZZ\), and \(Y_{norm}\), defined in \(\eqref{eq:sdata}\), and then use the lmm function to fit the LMMs as follows: \[lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d).\] The summary-level data doesn’t depend on the sample size \(n\). This makes lmm memory-efficient. The default method in the lmm and lmmfit functions is restricted maximum likelihood (REML), that is, method = ‘REML’. If you use the maximum likelihood (ML) method to fit the LMM, set method = ‘ML’ in the lmm and lmmfit functions.
In addition, the lmm and lmmfit functions perform hypothesis testing on the fixed effects (coefficients). We can also use the lmmtest function to conduct statistical tests on both the fixed effects and their contrasts for various comparisons between different levels.
LMM fitting returns a list of estimates of LMM parameters (coefficients and variance components), standard errors of the estimates, covariance matrix of the coefficients (fixed effects), and t-values and p-values for the hypothesis testing on the coefficients, for each gene. The LMM fitting also returns the number of iterations, the first partial derivatives of the log likelihood, and the log-likelihood at the termination of the iterations for each gene.
Exploring LMM fit and DE genes: We check if the LMM fitting is convergent, and perform hypothesis testing on the variance components of random effects. Then we identify DE genes based on hypothesis testing p-values for the coefficients (fixed effects). If the absolute first partial derivatives of log likelihood are all less than the convergence tolerance, the LMM fitting converges, otherwise it doesn’t converge. The genes for which LMM fitting doesn’t converge should be excluded in the subsequent analysis because the estimated coefficients for these genes are not reliable. DE genes can be identified by adjusting p-values using the false discovery rate (FDR) or family-wise error rate (Bonferroni correction). Additionally, we may exclude genes with small coefficients (effect size or log-fold change).
##Gene expression matrix, Y = log2(1 + counts)
Y <- log2(1 + dat$counts)
dat$counts <- NULL #Remove the counts.
##Design matrix for fixed effects
X <- model.matrix(~ 0 + log(libsize) + cls + cls:trt, data = dat$meta)
##Design matrix for random effects
Z <- model.matrix(~ 0 + as.factor(sam), data = dat$meta)
##Dimension of random effects
d <- ncol(Z)
rm(dat)
##Option 1: fit LMM by cell-level data.
max.iter <- 100
epsilon <- 1e-5
fit <- lmmfit(Y, X, Z, d = d, max.iter = max.iter, epsilon = epsilon)
##Option 2: fit LMM by summary-level data.
##(1) Compute the summary-level data.
n <- nrow(X)
XX <- t(X)%*%X
XY <- t(Y%*%X)
ZX <- t(Z)%*%X
ZY <- t(Y%*%Z)
ZZ <- t(Z)%*%Z
Ynorm <- rowSums(Y*Y)
#rm(X, Y, Z) #release the memory.
##(2) Fit LMM.
fitss <- lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d,
max.iter = max.iter, epsilon = epsilon)
identical(fit, fitss) #The two LMM fits are identical.
#> [1] TRUE
rm(fitss)
##Testing coefficients (fixed effects)
test <- lmmtest(fit)
#head(test)
##The testing t-value and p-values are also provided in the LMM fit.
range(test - cbind(t(fit$coef), t(fit$t), t(fit$p))) #identical
#> [1] 0 0
fit$t[, 1:4]
#> Gene1284 Gene724 Gene774 Gene1640
#> log(libsize) 9.9759533 8.0716046 8.464515 11.733608189
#> cls1 -8.5235110 -6.2329553 -6.832948 -9.380086777
#> cls10 -8.5255532 -6.2200933 -6.837642 -9.419146247
#> cls2 -8.5217821 -6.2340030 -6.856771 -9.438549994
#> cls3 -8.5112809 -6.2549674 -6.837254 -9.413566707
#> cls4 -8.4927904 -6.2509868 -6.839934 -9.465023927
#> cls5 -8.4831034 -6.2498292 -6.826692 -9.420796030
#> cls6 -8.5360323 -6.2482795 -6.807689 -9.406666370
#> cls7 -8.5148922 -6.2413472 -6.824715 -9.423283104
#> cls8 -8.5167202 -6.2280231 -6.798070 -9.474213289
#> cls9 -8.5215811 -6.2299695 -6.826893 -9.421405553
#> cls1:trtB -0.5179175 -1.1777057 -2.705067 -0.049377689
#> cls10:trtB -0.9129030 -1.3374074 -2.475557 0.104072809
#> cls2:trtB -0.7909786 -1.3742843 -2.449153 0.090249259
#> cls3:trtB -0.4802567 -1.0931098 -2.166174 0.002291076
#> cls4:trtB -0.9343680 -0.8571119 -2.414451 0.385714623
#> cls5:trtB -1.0912380 -1.0671190 -2.656099 0.100235459
#> cls6:trtB -0.7027346 -1.1206126 -2.681735 -0.050294720
#> cls7:trtB -0.8126754 -1.3489945 -2.919291 0.123471857
#> cls8:trtB -0.7190897 -1.2683630 -2.969423 0.316759560
#> cls9:trtB -0.7692254 -1.2075415 -2.679188 0.094227192
#fit$p[, 1:4]
The coefficients of the interaction term cls\(\ast\):trtB represent the effects of the treatment B versus A in a cell-type (cls\(\ast\)). We can make a comparison of the treatment B versus A within all cell-types by constructing a contrast by summing the treatment effects across cell-types as follows.
contrast <- cbind("BvsA" = numeric(nrow(fit$coef)))
index <- grep(":", rownames(fit$coef))
contrast[index, ] <- 1
##Test the contrast.
test <- lmmtest(fit, contrast = contrast)
head(test)
#> BvsA_coef BvsA_t BvsA_p
#> Gene1284 -0.5737701 -0.7894025 0.429878675
#> Gene724 -1.3981410 -1.2095726 0.226445815
#> Gene774 -1.7344350 -2.7313493 0.006308665
#> Gene1640 0.1737108 0.1126195 0.910332422
#> Gene1605 -0.1172052 -0.1236708 0.901576156
#> Gene591 -0.7280169 -0.9918408 0.321277571
The contrast can also be constructed by contrast.matrix function as follows.
BvsA <- paste0(paste0("cls", 1:10, ":trtB"), collapse = "+")
contrast <- contrast.matrix(contrast = c(BvsA = BvsA), model.matrix.names = rownames(fit$coef))
test <- lmmtest(fit, contrast = contrast)
head(test)
#> BvsA_coef BvsA_t BvsA_p
#> Gene1284 -0.5737701 -0.7894025 0.429878675
#> Gene724 -1.3981410 -1.2095726 0.226445815
#> Gene774 -1.7344350 -2.7313493 0.006308665
#> Gene1640 0.1737108 0.1126195 0.910332422
#> Gene1605 -0.1172052 -0.1236708 0.901576156
#> Gene591 -0.7280169 -0.9918408 0.321277571
##(1) Check the genes for which LMM fitting converges.
gc <- (apply(abs(fit$dlogL), 2, max) < epsilon)
sum(gc)
#> [1] 100
##(2) Hypothesis testing for variance components:
## H0, theta </= 0 vs H1, theta > 0.
z <- fit$theta["var1", ]/fit$se.theta["var1", ]
p <- pnorm(z, lower.tail = FALSE)
sum(p < 0.05)
#> [1] 100
##(3) The DE genes specific to a cell-type
##Coefficients, t-values, and p-values for the genes specific to a cell-type.
index <- grep(":", rownames(fit$coef))
ce <- fit$coef[index, gc]
tv <- fit$t[index, gc]
pv <- fit$p[index, gc]
out <- data.frame(
gene = rep(colnames(ce), nrow(ce)),
cluster = rep(rownames(ce), each = ncol(ce)),
coef = c(t(ce)), t = c(t(tv)), p = c(t(pv)))
##Adjust p-values by FDR.
out$FDR <- p.adjust(out$p, method = "fdr")
##The DE genes with FDR < 0.05 and abs(logFC) > 0.5
out <- out[order(out$p), ]
rownames(out) <- NULL
out[(out$FDR < 0.05) & (abs(out$coef) > 0.5) , ]
#> gene cluster coef t p FDR
#> 1 Gene1474 cls9:trtB 0.7522080 11.847221 2.338315e-32 2.338315e-29
#> 2 Gene1621 cls3:trtB 0.7202221 10.385943 2.959496e-25 1.479748e-22
#> 3 Gene1225 cls1:trtB 1.0430644 9.420141 4.596390e-21 1.532130e-18
#> 4 Gene1480 cls3:trtB 0.8160513 6.528223 6.687145e-11 1.671786e-08
#> 5 Gene864 cls10:trtB 0.5329352 5.805523 6.435866e-09 1.287173e-06
#> 6 Gene1723 cls7:trtB -0.6311016 -5.444817 5.198041e-08 8.663402e-06
#> 7 Gene1537 cls1:trtB -0.6764918 -5.195038 2.050789e-07 2.929699e-05
##The true DE genes
#dat$DEgenes
##Fitting LMM by ML method
fit <- lmmfit(Y, X, Z, d = d, method = "ML", max.iter = max.iter, epsilon = epsilon)
rm(X, Y, Z)
##The DE genes specific to a cell-type
##Coefficients, t-values, and p-values
index <- NULL
index <- grep(":", rownames(fit$coef))
ce <- fit$coef[index, gc]
tv <- fit$t[index, gc]
pv <- fit$p[index, gc]
out <- NULL
out <- data.frame(
gene = rep(colnames(ce), nrow(ce)),
cluster = rep(rownames(ce), each = ncol(ce)),
coef = c(t(ce)), t = c(t(tv)), p = c(t(pv)))
##Adjusting p-values by FDR
out$FDR <- p.adjust(out$p, method = "fdr")
##The DE genes with FDR < 0.05 and abs(logFC) > 0.5
out <- out[order(out$p), ]
rownames(out) <- NULL
out[(out$FDR < 0.05) & (abs(out$coef) > 0.5) , ]
#> gene cluster coef t p FDR
#> 1 Gene1474 cls9:trtB 0.7522108 12.317523 7.727225e-35 7.727225e-32
#> 2 Gene1621 cls3:trtB 0.7202211 10.806948 3.303671e-27 1.651835e-24
#> 3 Gene1225 cls1:trtB 1.0430712 9.803126 1.117563e-22 3.725210e-20
#> 4 Gene1480 cls3:trtB 0.8160486 6.799494 1.055741e-11 2.639352e-09
#> 5 Gene864 cls10:trtB 0.5329290 6.039966 1.546882e-09 3.093765e-07
#> 6 Gene1723 cls7:trtB -0.6310985 -5.670709 1.425981e-08 2.376635e-06
#> 7 Gene1537 cls1:trtB -0.6764864 -5.411547 6.262584e-08 8.946549e-06
##
sessionInfo()
#> R version 4.4.3 (2025-02-28)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Monterey 12.7.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Toronto
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] FLASHMM_1.2.0
#>
#> loaded via a namespace (and not attached):
#> [1] Matrix_1.7-3 miniUI_0.1.1.1 jsonlite_1.9.1 compiler_4.4.3
#> [5] promises_1.3.2 Rcpp_1.0.14 callr_3.7.6 later_1.4.1
#> [9] jquerylib_0.1.4 yaml_2.3.10 fastmap_1.2.0 lattice_0.22-6
#> [13] mime_0.13 R6_2.6.1 curl_6.2.1 knitr_1.50
#> [17] MASS_7.3-65 htmlwidgets_1.6.4 desc_1.4.3 profvis_0.4.0
#> [21] shiny_1.10.0 bslib_0.9.0 rlang_1.1.5 cachem_1.1.0
#> [25] httpuv_1.6.15 xfun_0.51 fs_1.6.5 sass_0.4.9
#> [29] pkgload_1.4.0 memoise_2.0.1 cli_3.6.4 magrittr_2.0.3
#> [33] ps_1.9.0 grid_4.4.3 processx_3.8.6 digest_0.6.37
#> [37] rstudioapi_0.17.1 xtable_1.8-4 remotes_2.5.0 devtools_2.4.5
#> [41] lifecycle_1.0.4 vctrs_0.6.5 evaluate_1.0.3 glue_1.8.0
#> [45] urlchecker_1.0.1 sessioninfo_1.2.3 pkgbuild_1.4.6 rmarkdown_2.29
#> [49] purrr_1.0.4 tools_4.4.3 usethis_3.1.0.9000 ellipsis_0.3.2
#> [53] htmltools_0.5.8.1
Building design matrices for fixed and random effects is a key step in LMM-based differential expression analysis. Including library size, a normalization factor for scRNA-seq, as a fixed effect can help reduce p-value inflation. If necessary, we can add principal components as fixed effects to further mitigate unknown batch effects.
In differential expression analysis, modeling samples (subjects) as random effects accounts for between-sample variability and within-sample correlation. If samples have different effects on gene expression, we can also model them as fixed effects. However, this can lead to overfitting when the number of samples (subjects) is large. For scRNA-seq data, this may also result in collinearity when the samples are nested within an experimental condition.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.