The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Microbiome sequencing produces compositional data with with many taxa having zero counts. Many methods of analyzing microbiome data replace a zero count by a pseudocounts, which can produce poor results.
The CAFT package implements a rank-based compositional analysis using an Accelerated Failure Time (AFT) modeling framework tailored for microbiome data. Zero read counts are treated as censored observations below a detection limit given by the library size (total counts for a sample). On the log-relative-abundance scale, the analysis becomes an implementation of the accelerated failure time model (AFT) in survival analysis. CAFT uses a rank-based method to fit the AFT which does not make model error distributional assumptions. CAFT supports taxon-level differential abundance testing for binary or multi-category phenotypes, with optional adjustment for continuous and categorical covariates.
CAFT fits a log-linear model to values of \(-\text{log}(\widehat{p}_{ij})\) where \(\widehat{p}_{ij}\) is the relative abundance of taxon \(j\) in sample \(i\). When \(\widehat{p}_{ij}=0\), it is considered to be left-censored at \(\widehat{p}_{ij}=1/N_i\) where \(N_i\) is the library size for the \(i\)th sample. Defining \(\Delta_{ij}=I(\widehat{p}_{ij}>0)\) and defining the right-censored “failure times” \(t_{ij}=-\text{log}(\widehat{p}_{ij})\) when \(\Delta_{ij}=1\) and \(t_{ij}=-\text{log}(1/N_i)=\text{log}(N_i)\) when \(\Delta_{ij}=0\), CAFT fits the model \[-t_{ij} =\alpha_j + x_{i\cdot} \beta_j + \epsilon_{ij}\] to relative abundances of each taxon \(j\), where \(x_{i \cdot}\) is the \(i\)th row of the data matrix \(x\). This model corresponds to the `Accelerated Failure Time (AFT) Model’ in survival analysis. CAFT uses a rank-based estimation procedure to fit this model that makes no assumptions on the distribution of \(\epsilon_{ij}\).
CAFT is designed to test hypotheses of the form \[\Gamma \beta = b\] in one of two ways. The easiest way is to write \[x=(x_{test},x_{adj})\] where \(x_{test}\) has as its columns the variables we wish to test, while \(x_{adj}\) has as its columns variables we wish to adjust for. If \(x_{test}\) has \(d\) columns and \(x_{adj}\) has \(m\) columns then CAFT uses the “default choice” \(\Gamma=(I_{d \times d},0_{m \times m})\) where \(I\) is the identity matrix and \(0\) is a matrix with all zeroes. The “default choice” for \(b\) is \(b=\Gamma \widehat{\beta}^{(m)}\) where \(\widehat{\beta}^{(m)}\) is the median (over taxa) of the \(\widehat{\beta}_j\) values. This gives a \(d\)-degree-of-freedom test of \(\beta_1=b_1, \beta_2=b_2,\cdots,\beta_d=b_d\). Alternatively, or for more complex tests, the user may specify their own values of \(\Gamma\) and \(b\) as inputs.
Install from the local file (the .tar.gz file), which you can download from https://github.com/mli171/CAFT.
You can also install the developer version from github via the following code.
The main function in CAFT package is
caft(
otu.table,
x.test = NULL,
x.adj = NULL,
x = NULL,
Gamma = NULL,
b = NULL,
filter.thresh = 0.05,
fdr.nominal = 0.20,
adjust.method = "BH",
regularize = TRUE,
test.method = "rank",
n.cores = 1L,
boot.B = 0L,
boot.parallel = FALSE,
boot.n.cores = 1L,
boot.seed = NULL,
boot.return.dist = FALSE,
verbose = FALSE,
return.mr.resid = FALSE
)The input parameters are: otu.table, a matrix of read counts with samples as rows and taxa as columns; x.test, a data matrix with samples as rows and columns corresponding to the variables we want to test; x.adj, a data matrix with samples as rows and columns corresponding to the variables we want to control for. Inputs x, Gamma and b will be explained later. CAFT filters out taxa that occur in proportionately fewer than filter.thresh samples (the default is 5%), and fdr.nominal is the desired nominal false-discovery rate. regularize adds a small quadratic penalty to the objective function to avoid infinite parameter estimates when there is complete separation of the data; its contribution is an order of nrow(otu.table) smaller than the solution when it is not required and can therefore be ignored.
The use of CAFT and the other input parameters for special cases are described below by example in the sections that follow, using two example data sets for illustration.
The CAFT package includes filtered data from two
studies: URT (upper respiratory tract), originally from [1]
and obtained from the R package LOCOM [2]; and
Colon from the R package
curatedMetagenomicData [3] (which can be installed
using Bioconductor). These processed microbiome data
sets can be loaded directly from CAFT package.
The URT data are a subset of data from a study of the microbiome of the upper respiratory tract [1], also used in the LOCOM package [2]. Following [2], we used only left oropharyngeal observations, and followed the same filtering steps as in the LOCOM paper [2]. We excluded three samples from individuals who had used antibiotics within three months prior to sample collection, and subset both the OTU count and taxonomy tables accordingly. The filtered OTU table have 57 rows (samples) and 856 columns (taxa), and the average censoring rate (the mean proportion of zero counts across taxa) is 89.35%.
The Colon data in CAFT are described in
the R package curatedMetagenomicData [3]. For our
analyses here, we retain only adult participants, stool specimens, and
samples with disease status labeled as colorectal cancer (CRC), adenoma,
or healthy.
We use the URT microbiome data set to illustrate the simplest use of CAFT for a single binary outcome. We first load the data and apply the filters described previously.
data("URT")
throat.otu.table <- URT$otu
throat.meta <- URT$meta
throat.otu.taxonomy <- URT$tax
filter.out.sam = which(throat.meta$AntibioticUsePast3Months_TimeFromAntibioticUsage != "None")
throat.otu.table.filter = throat.otu.table[-filter.out.sam,]
throat.meta.filter = throat.meta[-filter.out.sam,]
dim(throat.otu.table.filter)
#> [1] 57 856
cens.prop = colMeans(throat.otu.table.filter == 0, na.rm = T)
mean(cens.prop)
#> [1] 0.8935276Setting filter.thresh = 0.10 and
fdr.nominal = 0.10, we conduct the taxa differential
abundant analysis by entering \(x_{test}\) and \(x_{adj}\) as separate arguments, using
x.test = ifelse(throat.meta.filter$SmokingStatus == "NonSmoker", 0, 1)
x.adj = ifelse(throat.meta.filter$Sex == "Male", 0, 1)
res.CAFT.throat = caft(otu.table=throat.otu.table.filter, x.test=x.test, x.adj=x.adj,
filter.thresh = 0.10, fdr.nominal = 0.10)With the default multiple comparison adjustment method of “BH”, we can determine the significantly differential abundant taxa at the nominal FDR level as shown below.
If we want to see the p- or q-values for these taxa, we can use the code (only first 5 shown)
res.CAFT.throat$p.otu[res.CAFT.throat$q.detected.otu][1:5]
#> 4440 2831 2434 4363 1490
#> 1.975870e-03 2.776166e-03 1.604953e-03 2.861630e-03 2.939009e-05Note these are in the order that the taxa are given in the otu table.
Note also that discoveries at different FDR thresholds can be obtained by simply selecting values of res.CAFT.throat$q.otu < FDR threshold.
Setting the grouping label of each sample, we can obtain a
ggplot2-based boxplot showing the relative abundance of a specified OTU
using the function otuboxplot() in our
CAFT package.
groups = interaction(x.test, x.adj, sep = "_", drop = TRUE)
groups = factor(groups,
labels = c("Non-Smoker\nMale", "Smoker\nMale",
"Non-Smoker\nFemale", "Smoker\nFemale"))
boxplot.otu = "2831"
throat.otu.taxonomy[which(colnames(throat.otu.table.filter) == boxplot.otu),]
#> [1] "Root;Bacteria;Bacteroidetes;Bacteroidetes;Bacteroidales;Prevotellaceae;Prevotella"
otuboxplot(plot.otu=boxplot.otu, count.data=throat.otu.table.filter,
plot.title = "<i>Prevotellaceae Prevotella</i>",groups=groups)We use the Colon data to illustrate multivariate covariates such as multigroup categorial data. Here we also use the phyloseq package to extract the data: note that phyloseq is not required to run CAFT.
age_category == "adult"body_site == "stool"disease %in% c("CRC", "healthy", "adenoma")The filtered OTU table have 856 rows (samples) and 935 columns (taxa). For convenience of display in this vignette, we also set the names of each taxon to its column number.
data(Colon)
count.tab = Colon$otu
sample.tab = Colon$meta
tax.tab = Colon$tax
dim(count.tab)
#> [1] 856 935
colnames(count.tab)=1:ncol(count.tab)We also removed samples with missing age values (no samples were missing gender) and then filtered the dataset to retain only taxa present in more than one sample (i.e., taxa with positive counts in at least two samples), subsetting both the count and taxonomy tables accordingly. After this additional filtering, the OTU table has 849 rows (samples) and 708 columns (taxa), and the average censoring rate (the mean proportion of zero counts across taxa) is 87.20%.
pNA = which(is.na(sample.tab$age))
if(length(pNA) > 0){
count.tab = count.tab[-pNA, ]
sample.tab = sample.tab[-pNA,]
}
# No samples have missing values for gender
## otu presence filtering
p_otu = which(rowSums(t(count.tab) > 0) > 1)
count.tab = count.tab[,p_otu]
tax.tab = tax.tab[p_otu,]
dim(count.tab)
#> [1] 849 708
cens.prop = colMeans(count.tab == 0, na.rm = T)
mean(cens.prop)
#> [1] 0.8720479Then we put the covariates we wish to test into the columns of
x.test, setting “healthy” as the reference level and
including two indicators for “CRC” and “adenoma”. We also put the
covariates we wish to adjust for (the variables “Age” and “Gender”) into
the columns of x.adj.
Disease1 = Disease2 = rep(0, NROW(sample.tab)) # healthy
Disease1[sample.tab$disease == "CRC"] = 1
Disease2[sample.tab$disease == "adenoma"] = 1
Age = as.numeric(sample.tab$age)
Gender = as.numeric(factor(sample.tab$gender)) - 1
x.test = cbind(CRC = Disease1, adenoma = Disease2)
x.adj = cbind(Age = Age, Gender = Gender)We can now run CAFT exactly as in the univariate example:
res.CAFT.Colon = caft(otu.table = count.tab, x.test=x.test, x.adj=x.adj, filter.thresh = 0.10, fdr.nominal = 0.10)
res.CAFT.Colon$Gamma
#> [,1] [,2] [,3] [,4]
#> [1,] 1 0 0 0
#> [2,] 0 1 0 0
res.CAFT.Colon$q.detected.otu
#> [1] "1" "2" "3" "5" "7" "8" "9" "11" "12" "13" "14" "16"
#> [13] "17" "18" "19" "21" "22" "24" "25" "27" "28" "29" "31" "32"
#> [25] "34" "36" "38" "48" "49" "54" "60" "64" "74" "79" "80" "82"
#> [37] "83" "87" "94" "99" "112" "113" "123" "140" "153" "163" "172" "174"
#> [49] "195" "196" "216" "222" "228" "243" "252" "306" "312" "331" "381"We see that 59 taxa were detected at the FDR=0.1 nominal level. If we want to see the p- or q-values for these taxa, we can use the code (only first 10 shown)
res.CAFT.Colon$p.otu[res.CAFT.Colon$q.detected.otu][1:10]
#> 1 2 3 5 7 8
#> 1.098651e-21 5.461108e-08 6.405993e-07 5.423462e-03 2.634203e-03 4.515287e-08
#> 9 11 12 13
#> 5.357031e-06 1.454428e-09 1.506928e-06 1.835382e-02Again, note these are in the order the taxa appear in the otu table, not the size of the p-value.
The CAFT analysis can also be carried out in two separate steps. The advantage of this approach is computational, and is explained after we describe the two steps.
For the first step, we use caft_estimate to obtain
the \(\widehat{\beta}_j\)s, the
unrestricted estimates of parameters \(\beta_j\). Because no hypothesis is
specified, data should be entered using the full data matrix x. If data
are stored as x_test' andx_adj’, then simply combine them
using `x=cbind(x_test,x_adj)’.
est.CAFT = caft_estimate(otu.table=count.tab, x=cbind(x.test, x.adj), filter.thresh=0.10, regularize=TRUE, n.cores=1L)
print(est.CAFT)
#> CAFT unrestricted estimation
#> Samples: 849
#> Taxa: 708
#> Covariates: 4
#> Taxa used: 217
#> Taxa skipped by rarity filter: 491
#> Taxa failed in unrestricted fit: 0
head(est.CAFT$beta.est)
#> b1.est b2.est b3.est b4.est
#> 1 0.12129121 0.10417073 0.0009675841 0.01772128
#> 2 0.14476170 -0.08198201 -0.0127392812 0.12949872
#> 3 0.08282945 0.05023630 -0.0039049851 0.08690792
#> 4 -0.31959266 0.17216115 0.0032853348 -0.54353098
#> 5 -0.15146946 -0.08585543 -0.0005585079 -0.09654635
#> 6 -0.08010923 0.18089992 0.0128099367 0.22558895For the second step, we specify a hypothesis we want to test, either
by specifying values of x.test and x.adj or by
specifying Gamma and b. Using these inputs,
caft_test will calculate the restricted parameter
estimates \(\widehat{\beta}_j^{(r)}\)
that satisfy the null hypothesis \(\Gamma
\widehat{\beta}_j^{(r)}=b\), and conduct the hypothesis test to
determine whether to accept or reject the hypothesis \(\Gamma \beta=b\). Since the unrestricted
estimates are the same regardless of the null hypothesis we wish to test
(as long as the variables in x are not changed), the
multi-step analysis can be more efficient. In the example below, we test
the hypothesis \(\Gamma \beta = b\).
Because we do not specify \(b\), the
default choice \(b=\text{median}(\Gamma
\widehat{\beta})\) is used, where \(\widehat{\beta}\) are the unrestricted
estimators obtained from caft_estimate.
For example, we can test the significance of the two coefficients corresponding to disease category, controlling for age and gender using
Gamma=cbind(diag(2), matrix(0,nrow=2, ncol=2) )
Gamma
#> [,1] [,2] [,3] [,4]
#> [1,] 1 0 0 0
#> [2,] 0 1 0 0which we can see is the same as
from the previous section. Then, we run caft_test to finish the analysis:
test.CAFT = caft_test(est.CAFT, Gamma=Gamma, fdr.nominal=0.10, adjust.method="BH")
print(test.CAFT)
#> CAFT restricted score test
#> Samples: 849
#> Taxa: 708
#> Covariates: 4
#> Tested contrasts: 2
#> FDR level: 0.1
#> Adjustment method: BH
#> Taxa with p < 0.1 : 89
#> Taxa with q < 0.1 : 59
#> Taxa failed in restricted test: 0
test.CAFT$betahat.median
#> [1] -0.33370892 0.06554493
test.CAFT$b.null
#> [1] -0.33370892 0.06554493
test.CAFT$q.detected.otu
#> [1] "1" "2" "3" "5" "7" "8" "9" "11" "12" "13" "14" "16"
#> [13] "17" "18" "19" "21" "22" "24" "25" "27" "28" "29" "31" "32"
#> [25] "34" "36" "38" "48" "49" "54" "60" "64" "74" "79" "80" "82"
#> [37] "83" "87" "94" "99" "112" "113" "123" "140" "153" "163" "172" "174"
#> [49] "195" "196" "216" "222" "228" "243" "252" "306" "312" "331" "381"
test.CAFT$p.otu[test.CAFT$q.detected.otu][1:10]
#> 1 2 3 5 7 8
#> 1.098651e-21 5.461108e-08 6.405993e-07 5.423462e-03 2.634203e-03 4.515287e-08
#> 9 11 12 13
#> 5.357031e-06 1.454428e-09 1.506928e-06 1.835382e-02These results should be identical to those obtained in the previous section.
The advantage of the separate analysis path is that we can now test for a difference in the effect of the disease categories CRC and adenoma without having to run caft_estimate again (i.e., using the same unrestricted null estimates), by executing the following steps:
Gamma=matrix( c(1,-1,0,0), nrow=1)
test.CAFT.2 = caft_test(est.CAFT, Gamma=Gamma, fdr.nominal=0.20, adjust.method="BH")
print(test.CAFT.2)
#> CAFT restricted score test
#> Samples: 849
#> Taxa: 708
#> Covariates: 4
#> Tested contrasts: 1
#> FDR level: 0.2
#> Adjustment method: BH
#> Taxa with p < 0.2 : 69
#> Taxa with q < 0.2 : 39
#> Taxa failed in restricted test: 0
test.CAFT.2$betahat.median
#> [1] -0.4017934
test.CAFT.2$b.null
#> [1] -0.4017934
test.CAFT.2$q.detected.otu
#> [1] "1" "2" "3" "5" "7" "9" "11" "14" "16" "17" "18" "19"
#> [13] "24" "27" "28" "29" "60" "79" "80" "87" "94" "99" "112" "116"
#> [25] "123" "129" "140" "172" "195" "196" "200" "203" "216" "222" "252" "312"
#> [37] "331" "372" "381"
length(test.CAFT.2$q.detected.otu)
#> [1] 39
test.CAFT.2$p.otu[test.CAFT.2$q.detected.otu][1:10]
#> 1 2 3 5 7 9
#> 7.506836e-08 3.769040e-05 5.299649e-03 8.241583e-03 1.473945e-02 6.265867e-03
#> 11 14 16 17
#> 3.485724e-04 2.784984e-03 2.540297e-03 5.011159e-04We see that there are 39 taxa at which there is evidence that \(\beta_1=\beta_{\text{CRC}}\) is different from \(\beta_2=\beta_{\text{adenoma}}\).
Note that, if the x.test and x.adj input
format is used in caft_test, the order of the variables
must be the same as was used when running
caft_estimate.
The object test.CAFT$betahat.median gives the value of
\(\text{median}(\Gamma
\widehat{\beta})\), default value of \(b\), the median-based compositional
reference value. The user may specify a different value of \(b\) (e.g., for sensitivity analyses or to
choose a different reference, such as the mode of \(\Gamma \widehat{\beta}\) rather than the
median), which is returned in test.CAFT$b.null.
We can also implement multi-threads parallel computing to speed up
the computation by setting the values of n.cores based on
available computation resources.
CAFT also provides bootstrap version to calculate taxon-level
p-values when either the number of samples or the number of taxa is
small. The bootstrap calculation can be requested by setting
boot.B >= 2. In practice, a larger number of bootstrap
replicates, such as boot.B = 1000, is recommended, as the
smallest non-zero \(p\)-value that can
be obtained is 1/(boot.B+1). Note that the bootstrap version of CAFT can
be very slow.
res.CAFT.boot = caft(otu.table=count.tab, x.test=x.test, x.adj=x.adj, boot.B=1000, boot.seed=1)
res.CAFT.boot$q.boot.detected.otu
res.CAFT.boot$q.boot.chi.detected.otuBootstrap calibration can also be parallelized over bootstrap replicates. The following code is not evaluated in this vignette because the optimal number of cores depends on the user’s computing environment.
If you use the CAFT package or the CAFT method in your work, please cite:
Satten, G. A., Li, M., & Zhao, N. (2025). CAFT: A Compositional Log-Linear Model for Microbiome Data with Zero Cells. bioRxiv, 2025.11.26.690468. https://doi.org/10.1101/2025.11.26.690468
A BibTeX entry is:
[1] Charlson, E. S., Chen, J., Custers-Allen, R., et al. Disordered microbial communities in the upper respiratory tract of cigarette smokers. PloS one, 5(12), e15216. https://doi.org/10.1371/journal.pone.0015216
[2] Hu, Y., Satten, G. A., & Hu, Y. J. LOCOM: A logistic regression model for testing differential abundance in compositional microbiome data with false discovery rate control. Proceedings of the National Academy of Sciences, 119(30), e2122788119 (2022). https://doi.org/10.1073/pnas.2122788119
[3] Pasolli, E., Schiffer, L., Manghi, P., et al. Accessible, curated metagenomic data through ExperimentHub. Nature Methods, 14(11), 1023–1024 (2017). https://doi.org/10.1038/nmeth.4468
[4] Satten, G. A., Li, M., & Zhao, N. CAFT: A Compositional Log-Linear Model for Microbiome Data with Zero Cells. bioRxiv, 2025.11.26.690468 (2025). https://doi.org/10.1101/2025.11.26.690468
sessionInfo()
#> R version 4.5.0 (2025-04-11)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 26.5.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Chicago
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] CAFT_0.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.10 generics_0.1.4 xml2_1.5.2 stringi_1.8.7
#> [5] lattice_0.22-9 digest_0.6.39 magrittr_2.0.5 evaluate_1.0.5
#> [9] grid_4.5.0 RColorBrewer_1.1-3 iterators_1.0.14 mvtnorm_1.4-1
#> [13] fastmap_1.2.0 foreach_1.5.2 doParallel_1.0.17 jsonlite_2.0.0
#> [17] Matrix_1.7-5 DBI_1.3.0 survival_3.8-6 ggtext_0.1.2
#> [21] doRNG_1.8.6.3 scales_1.4.0 codetools_0.2-20 jquerylib_0.1.4
#> [25] cli_3.6.6 mitools_2.4 rlang_1.2.0 expm_1.0-0
#> [29] litedown_0.9 splines_4.5.0 commonmark_2.0.0 withr_3.0.3
#> [33] cachem_1.1.0 yaml_2.3.12 otel_0.2.0 tools_4.5.0
#> [37] parallel_4.5.0 dplyr_1.1.2 ggplot2_4.0.3 rngtools_1.5.2
#> [41] vctrs_0.7.3 R6_2.6.1 lifecycle_1.0.5 stringr_1.6.0
#> [45] MASS_7.3-65 pkgconfig_2.0.3 bslib_0.11.0 pillar_1.11.1
#> [49] gtable_0.3.6 glue_1.8.1 Rcpp_1.1.1-1.1 xfun_0.59
#> [53] tibble_3.3.1 tidyselect_1.2.1 rstudioapi_0.18.0 knitr_1.51
#> [57] farver_2.1.2 survey_4.5 htmltools_0.5.9 rmarkdown_2.31
#> [61] labeling_0.4.3 compiler_4.5.0 ICSNP_1.1-3 S7_0.2.2
#> [65] ICS_1.4-2 markdown_2.0 gridtext_0.1.6These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.