Introduction to IFAA

IFAA is a novel approach to make inference on the association of covariates with the absolute abundance (AA) of microbiome in an ecosystem. It can be also directly applied to relative abundance (RA) data to make inference on AA because the ratio of two RA is equal ratio of their AA. This algorithm can estimate and test the associations of interest while adjusting for potential confounders. High-dimensional covariates are handled with regularization. The estimates of this method have easy interpretation like a typical regression analysis. This algorithm can find optimal reference taxa/OTU/ASV and control FDR by permutation.

To model the association, the following equation is used: \[ \log(\mathcal{Y}_i^k)|\mathcal{Y}_i^k>0=\beta^{0k}+X_i^T\beta^k+W_i^T\gamma^k+Z_i^Tb_i+\epsilon_i^k,\hspace{0.2cm}k=1,...,K+1, \] where

The challenge in microbiome analysis is that we can not oberve \(\mathcal{Y}_i^k\). What is observed is its small proportion: \(Y_i^k=C_i\mathcal{Y}^k_i\) where \(C_i\) is an unknown number between 0 and 1 that denote the observed proportion. The IFAA method can handle this challenge by identifying and employing reference taxa.

Package installation

To install, type the following command in R console:

install.packages("IFAA", repos = "http://cran.us.r-project.org")

The package could be also installed from GitHub using the following code:

require(devtools)
devtools::install_github("gitlzg/IFAA")
library(IFAA)

Input and Output for IFAA() function

The IFAA() function is the main function. The User Inputs are:

The output of IFAA() function is a list. The estimation results can extracted as the following:

The covariates data including testCov and ctrlCov can be extracted in the output:

Examples

The example datasets dataM and dataC are included in the package. They could be accessed by:

library(IFAA)

data(dataM)
dim(dataM)
#> [1] 20 60
dataM[1:5, 1:8]
#>   id rawCount1 rawCount2 rawCount3 rawCount4 rawCount5 rawCount6 rawCount7
#> 1  1         0         0         0         0         0         3         0
#> 2  2         0         0         0         0         0         0         0
#> 3  3         0         0         0         0         0       214         0
#> 4  4         0         0         0         0         0         2         0
#> 5  5         0         0         0         0         0        40         0

data(dataC)
dim(dataC)
#> [1] 20  6
dataC[1:5, ]
#>   id v4       v1 v5 v2 v3
#> 1  1  1 1.653901  4  1 NA
#> 2  2  2 0.362706  5  2  2
#> 3  3  1 1.496269 NA  5  2
#> 4  4  1 1.755541  5  3  3
#> 5  5  1 1.035714  5  7 NA

Both the microbiome data dataM and the covariates data dataC contain 20 samples (i.e., 20 rows).

Next we analyze the data to test the association between microbiome and the two variables "v1" and "v2" while adjusting for the variable "v3".

results <- IFAA(MicrobData = dataM,
                CovData = dataC,
                linkIDname = "id",
                testCov = c("v1", "v2"),
                ctrlCov = c("v3"),
                nRef = 4,
                nPermu = 4,
                fwerRate = 0.25,
                bootB = 5)
#> There are41taxa without any sequencing reads and
#>         excluded from the analysis
#> Data dimensions (after removing missing data if any):
#> 13samples
#> 18OTU's or microbial taxa
#> 2testCov variables in the analysis
#> These are the testCov variables:
#> v1v2
#> 1ctrlCov variables in the analysis
#> These are the ctrlCov variables:
#> v3
#> 0binary covariates in the analysis
#> 54.27percent of microbiome sequencing reads are zero
#> Start Phase 1 association identification
#> start Original screen
#> 14parallel jobs are registered for analyzing4reference taxa in Phase 1a.
#> OriginDataScreen parallel setup took2.58seconds
#> Original screen done and took0.0491666666666667minutes
#> start to run permutation
#> 14parallel jobs are registered for the permutation analysis in Phase 1b.
#> Permutation done and took0.160666666666667minutes
#> Phase 1 Associaiton identification is done and used0.263166666666667minutes
#> Start Phase 2 parameter estimation
#> Final Reference Taxa are:rawCount9
#> Start estimation for the1th final reference taxon:rawCount9
#> 16parallel jobs are registered for bootstrp in Phase 2.
#> Estimation done for the1th final reference taxon:rawCount9and it took0.056minutes
#> Phase 2 parameter estimation done and took0.056minutes.
#> The entire analysis took0.319833333333333minutes

In this example, we are only interested in testing the association with "v1" and "v2" which is why testCov=c("v1,"v2"). The variable "v3" is adjusted as a potential confounder in the analyses. For the sake of speed in this hypothetical example, we set small numbers for nRef=4, nPermu=4 and bootB=5. These are just for illustration purpose here and are too small for a formal analysis to generate valid results.

The final analysis results are stored in the list analysisResults$estByCovList:

results$analysisResults$estByCovList
#> $v2
#>              Beta.LPR LowB95%CI.LPR UpB95%CI.LPR
#> rawCount29 0.04045972   0.007860232   0.05096203
#> rawCount42 0.02472210  -0.030848352   0.05009310

The results found the two taxa "rawCount29" and "rawCount42" associated with "v2". The regression coefficients and their 95% confidence intervals are provided. These coefficients correspond to \(\beta^k\) in the model equation.

The interpretation is that

All the analyzed covariates including testCov and ctrlCov are stored in the object: covariatesData:

results$covariatesData
#>    id          v1 v2  v3
#> 2   2  0.36270596  2   2
#> 3   3  1.49626921  5   2
#> 4   4  1.75554095  3   3
#> 6   6  1.64525227  4   4
#> 8   8 -1.57781131 24  22
#> 9   9  2.22581203 55   5
#> 10 10  0.71642615 98  67
#> 12 12  2.12230160 98   3
#> 14 14  1.99387922 93   4
#> 16 16  0.05417617 83  34
#> 18 18 -0.43426021 73  67
#> 19 19  1.46579846 68 566
#> 20 20  1.89625949 63  34

MZILN() function

The IFAA package also offers the MZILN() function to implement the Multivariate Zero-Inflated Logistic Normal regression model for analyzing microbiome data. The regression model for MZILN() can be expressed as follows: \[ \log\bigg(\frac{\mathcal{Y}_i^k}{\mathcal{Y}_i^{K+1}}\bigg)|\mathcal{Y}_i^k>0,\mathcal{Y}_i^{K+1}>0=\alpha^{0k}+\mathcal{X}_i^T\alpha^k+\epsilon_i^k,\hspace{0.2cm}k=1,...,K, \] where

Input and Output for MZILN() function

The MZILN() function is to implement the Multivariate Zero-Inflated Logistic Normal model. It estimates and tests the associations given a user-specified reference taxon/OTU/ASV, whereas the ‘IFAA()’ does not require any user-specified reference taxa. If the user-specified taxon is independent of the covariates, ‘MZILN()’ should generate similar results as ‘IFAA()’. The User Inputs for ‘MZILN()’ are:

The output of MZILN() function is a list. The estimation results can extracted as the following:

All covariates data can be extracted:

Examples

We use the same example data The example dataset as that for illustrating the IFAA function. dataM and dataC are included in the package. They could be accessed by:

data(dataM)
dim(dataM)
#> [1] 20 60
dataM[1:5, 1:8]
#>   id rawCount1 rawCount2 rawCount3 rawCount4 rawCount5 rawCount6 rawCount7
#> 1  1         0         0         0         0         0         3         0
#> 2  2         0         0         0         0         0         0         0
#> 3  3         0         0         0         0         0       214         0
#> 4  4         0         0         0         0         0         2         0
#> 5  5         0         0         0         0         0        40         0

data(dataC)
dim(dataC)
#> [1] 20  6
dataC[1:5, ]
#>   id v4       v1 v5 v2 v3
#> 1  1  1 1.653901  4  1 NA
#> 2  2  2 0.362706  5  2  2
#> 3  3  1 1.496269 NA  5  2
#> 4  4  1 1.755541  5  3  3
#> 5  5  1 1.035714  5  7 NA

Both the microbiome data dataM and the covariates data dataC contain 20 samples (i.e., 20 rows).

Next we analyze the data to test the association between microbiome and all the three variables "v1", "v2" and "v3".

results <- MZILN(MicrobData = dataM,
                CovData = dataC,
                linkIDname = "id",
                allCov = c("v1","v2","v3"),
                refTaxa=c("rawCount11")
                )
#> There are41taxa without any sequencing reads and
#>         excluded from the analysis
#> Data dimensions (after removing missing data if any):
#> 13samples
#> 18OTU's or microbial taxa
#> 3covariates in the analysis
#> These are the covariates:
#> v1v2v3
#> 0binary covariates in the analysis
#> 54.27percent of microbiome sequencing reads are zero
#> start Original screen
#> OriginDataScreen parallel setup took2.47seconds
#> Loading required package: MASS
#> Loading required package: Matrix
#> 
#> Attaching package: 'expm'
#> The following object is masked from 'package:Matrix':
#> 
#>     expm
#> Original screen done and took0.0363333333333333minutes
#> Reference taxa are: rawCount11
#> 16parallel jobs are registered for bootstrp in Phase 2.
#> Estimation done for the 1 th reference taxon: rawCount11 and it took 0.07283333 minutes
#> The entire analysis took0.154833333333333minutes

In this example, we are only interested in testing the associations with "v1", "v2" and ‘“v3”’ which is why allCov=c("v1,"v2","v3").

The final analysis results are stored in the list results$analysisResults$estByRefTaxaList$rawCount11$estByCovList:

results$analysisResults$estByRefTaxaList$rawCount11$estByCovList
#> $v2
#>              Beta.LPR LowB95%CI.LPR UpB95%CI.LPR
#> rawCount29 0.03583529  -0.002122168   0.07067514
#> rawCount42 0.02566816  -0.017277108   0.06864125
#> 
#> $v3
#>                 Beta.LPR LowB95%CI.LPR  UpB95%CI.LPR
#> rawCount6  -0.0034889572   -0.01529409  0.0054946220
#> rawCount29 -0.0041563989   -0.01549365  0.0046623846
#> rawCount32 -0.0006867174   -0.01186827  0.0106614073
#> rawCount42 -0.0089560683   -0.01974328 -0.0002497733
#> rawCount45 -0.0084721161   -0.01970520  0.0009583073
#> rawCount47 -0.0034332912   -0.01697852  0.0115896848

The results found the two taxa "rawCount29" and "rawCount42" associated with "v2", and a bunch of other taxa assoicated with “‘v3’”. The regression coefficients and their 95% confidence intervals are provided. These coefficients correspond to \(\alpha^k\) in the model equation, and can be interpreted as the associations between the covariates and log-ratio of the significant taxa over the reference taxon..

The interpretation is that

All the analyzed covariates are stored in the object: covariatesData:

results$covariatesData
#>    id          v1 v2  v3
#> 2   2  0.36270596  2   2
#> 3   3  1.49626921  5   2
#> 4   4  1.75554095  3   3
#> 6   6  1.64525227  4   4
#> 8   8 -1.57781131 24  22
#> 9   9  2.22581203 55   5
#> 10 10  0.71642615 98  67
#> 12 12  2.12230160 98   3
#> 14 14  1.99387922 93   4
#> 16 16  0.05417617 83  34
#> 18 18 -0.43426021 73  67
#> 19 19  1.46579846 68 566
#> 20 20  1.89625949 63  34