The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Analysing sparse ecological percent cover data using gllvm

Pekka Korhonen

2024-11-26

This short document showcases how to use the gllvm package to analyse multivariate percent cover data. Namely, we show how to apply the hurdle beta GLLVM (with logistic link), as detailed in Korhonen et al. (2024), to analyse the kelp forest dataset from the Santa Barbara Coastal Long-Term Ecological Research project, available from https://doi.org/10.6073/pasta/0af1a5b0d9dde5b4e5915c0012ccf99c.

A multivariate percent cover dataset often comes in the form of a \(n \times m\) matrix \(\bf{Y}\), with \(n\) being the number of observational units/sites, and \(m\) being the number of species of plants, macroalgae, sessile invertebrates, et cetera. The response \(y_{ij}\) is then the recorded percentage of the relative area covered by species \(j\) on unit/site \(i\). Typically such datasets contain considerable proportion of zero observations, as a given obs. unit is often populated by only a small subset of the \(m\) species in total. More rarely, it might even be the case that one of the species covers some obs. unit completely.

Hurdle beta GLLVM

Traditionally, beta regression has been used to model responses that take the form of percentages. If \(y^*_{ij}\) is in the (open) interval \((0,1)\), and is distributed according to beta distribution, \(y^*_{ij} \sim \text{Beta}(\mu_{ij}, \phi_j)\), then it has the following density: \[ f_{\text{beta}}(y_{ij}^*; \mu_{ij}, \phi_j) = \frac{\Gamma(\phi_j)}{\Gamma(\mu_{ij}\phi_j)\Gamma(\phi_j-\phi_j\mu_{ij})} (y_{ij}^*)^{\mu_{ij}\phi_j-1}(1-y_{ij}^*)^{\phi_j(1-\mu_{ij}-1)}.\]

The mean parameter \(\mu_{ij}\) can be connected to a set of covariates and latent variables through some link function, by the equation \(g(\mu_{ij})=\eta_{ij} = \beta_{0j} + \boldsymbol{\beta}_j^\top\bf{x_i} + \boldsymbol{\lambda}_j^\top\bf{u_i}\).

However, such a model is ill-suited for most percent cover datasets, due to the fact that the beta distribution isn’t capable of handling zero (or \(100\%\)) responses. To make use of the beta regression model in such scenario, one needs to use some transformation in order to ‘push’ the responses away from the boundaries. This procedure might provide reasonable results when the numbers of zeros or ones in the data are relatively low. On the other hand, by transforming the zeros and ones, we might lose some important information.

A more sophisticated way of tackling such issue is by considering a zero-accommodating model, couple of which have been proposed recently. One such model is the hurdle beta model, which models the two (can be extended to three if \(100\%\) covers are recorded, see Korhonen et al. (2024)) classes the response \(y_{ij}\) can take, i.e., \(\{0\}\) and \((0,1)\), by separate processes. Namely, the zeros are assumed to be generated by a Bernoulli process. Conditional on \(y_{ij}\in(0,1)\), the response is modeled using the standard beta distribution as presented above. The likelihood function for a response \(Y_{ij}\) following the hurdle beta distribution is of the form: \[ P(Y_{ij};\mu_{ij}, \mu_{ij}^0, \phi_j) = \begin{cases} 1-\mu_{ij}^0, & Y_{ij} = 0,\\ \mu_{ij}^0 \cdot f_{\text{beta}}(Y_{ij};\mu_{ij},\phi_j), & Y_{ij} \in (0,1). \\ \end{cases} \] where \(g(\mu_{ij}^0) = \eta_{ij}^0 = \beta_{0j}^0 + \bf{x_i}^\top\boldsymbol{\beta}_j^0 + \bf{u_i}^\top\boldsymbol{\lambda}_j^0\) for and \(g(\mu_{ij}) = \eta_{ij} = \beta_{0j} + \bf{x_i}^\top\boldsymbol{\beta}_j + \bf{u_i}^\top\boldsymbol{\lambda}_j\) and \(g(\cdot)\) can be either probit or logistic link function. Note, that here, the separate linear predictors share the same environmental covariates \(\bf{x}_i\) and latent variable scores \(\bf{u}_i\), while the coefficients and loadings are allowed to differ.

The gllvm package implements the hurdle beta GLLVM with two different estimation methods available. First, accessed by the argument method="VA" when calling uses a hybrid approach, where the method of variational approximations, or VA, is applied to the Bernoulli-process part of the data (only probit link allowed), while the method of extended variational approximations, see Korhonen et al. (2023), is applied to the beta distributed part. By instead specifying method="EVA", the EVA method is applied to both parts of the likelihood.

Example

In the following, we show how to fit the logistic hurdle beta GLLVM using EVA on the SBC LTER marine macroalgae (i.e., seaweed) percent cover dataset. The data has been collected in 2000-2020 along 44 permanent transect lines along coastal southern California. We will specify a model with two latent variables (for ordination) and will include the rockiness of the seabed and the average number of stripes of giant kelp as environmental covariates in the model.

library(gllvm)
data("kelpforest")
Yabund <- kelpforest$Y
Xenv <- kelpforest$X
SPinfo <- kelpforest$SPinfo

# Data contains both algae and sessile invertebrates
table(SPinfo$GROUP)
## 
##  ALGAE INVERT  PLANT 
##     61     69      2
# Select only the macroalgae:
Yalg <- Yabund[,SPinfo$GROUP=="ALGAE"]

# Remove empty rows and rows with missing values from the data
rem0 <- which((rowSums(Yalg)==0) | is.na(rowSums(Yalg)))
Yalg <- Yalg[-rem0,]
Xenv <- Xenv[-rem0,]

# Use only the data from the year 2016:

Yalg <- Yalg[Xenv$YEAR==2016,]
Xenv <- Xenv[Xenv$YEAR==2016,]

# Remove species which have no observations
Yalg <- Yalg[,-which(colSums(Yalg>0)==0)]
# Number of obs. and species:
dim(Yalg)
## [1] 44 53
# Specify the covariates in the linear predictor
Xformulai = ~KELP_FRONDS + PERCENT_ROCKY

After setting up the data, LV design and the covariates, the model is estimated by

fit <- gllvm(Yalg, X=Xenv, formula = Xformulai, family = "betaH", method="EVA", 
             num.lv = 2, link="logit", control=list(reltol=1e-12))

To inspect e.g., the covariate effects, use

fit$params$Xcoef
##            KELP_FRONDS PERCENT_ROCKY
## AMZO     -6.926576e-04 -3.419449e-02
## AU       -1.020450e-01 -3.089640e-03
## BF       -1.485643e+00  6.268714e-02
## BLD      -4.593091e-03 -6.824021e-02
## BO        1.082033e-01 -1.703388e-02
## BR        3.593815e-02  8.754444e-03
## BRA       3.249536e-01 -8.187708e-02
## CAL      -8.898427e-02  4.872812e-03
## CC        1.688338e-01 -9.612437e-03
## CF       -3.537180e-01  3.778452e-02
## CG       -1.431699e-02 -4.172088e-02
## CO       -7.171867e-02  9.536235e-03
## COF      -1.492480e-01  3.649886e-02
## CP       -3.490993e-01  3.336247e-02
## CRYP      1.105240e-03 -4.021097e-06
## CYOS      9.591773e-03 -6.121095e-03
## CZ       -1.743722e-03 -3.236607e-02
## DL        6.640460e-02 -1.284277e-03
## DMH      -3.610327e-02 -4.747740e-03
## DP       -6.418505e-02  7.576342e-03
## DU       -1.009267e+00 -2.734474e-03
## EAH       3.489164e+00 -3.518374e-02
## EC        9.778532e-02  1.264536e-02
## EH        1.425385e+00 -1.184152e-01
## ER        6.652890e-02  2.281598e-02
## FASP     -1.681229e-03 -5.414127e-02
## FB       -3.352533e-01  8.559165e-03
## FTHR     -1.988998e-02  1.674417e-03
## GR        5.405880e-01  3.356715e-02
## GS       -2.457032e-01 -3.474810e-03
## GYSP     -8.023519e-02 -4.342719e-03
## IR       -1.301316e-03 -4.854652e-02
## LH       -3.000163e-03 -4.123936e-02
## LI       -2.039607e-04 -1.835646e-01
## LS       -6.287617e-04 -8.840235e-03
## LX       -6.083141e-03 -8.163522e-02
## MH        1.070132e-01 -1.872839e-03
## NIE      -3.717803e-02 -8.403467e-03
## PH        1.422250e+00 -2.047847e-01
## PHSE     -2.939904e-01  2.578593e-03
## PL        9.841258e-02  7.086878e-04
## POLA     -3.288605e-01 -4.030514e-02
## PRSP      2.190012e-03 -1.579337e-04
## R         3.652964e-02  1.005004e-02
## RAT      -2.353364e-02  3.655290e-03
## SAFU     -8.732058e-01  1.686067e-02
## SAHO     -2.367829e+00 -1.298932e-02
## SAMU     -9.875292e-02 -1.795740e-04
## SCCA     -5.261532e-02  4.094153e-02
## STIN     -1.117553e-04 -1.005798e-01
## TALE      1.545333e-01 -2.454684e-02
## UEC      -1.858629e-05 -9.293147e-03
## UV        1.142739e+00  1.196097e-03
## H01_AMZO -1.077519e+00  1.182957e-01
## H01_AU   -1.610227e-01 -3.658912e-02
## H01_BF   -1.827179e-01 -6.135933e-02
## H01_BLD  -1.155947e+00  6.226792e-02
## H01_BO    1.842461e-01 -8.565025e-03
## H01_BR   -1.380823e-01  1.158724e-03
## H01_BRA   1.586778e-02 -6.345348e-02
## H01_CAL  -5.542341e-01  1.946985e-01
## H01_CC   -5.891700e-01  4.040126e-02
## H01_CF   -1.281117e-01  2.861065e-03
## H01_CG   -1.589743e-01  3.181475e-01
## H01_CO    6.063362e-02 -1.211045e-02
## H01_COF  -2.809342e+00  2.419088e-01
## H01_CP   -2.889048e-01  1.428933e-01
## H01_CRYP -8.746774e-02 -2.461221e-02
## H01_CYOS  1.252789e-01 -1.927851e-02
## H01_CZ   -4.066913e-01  2.326639e-01
## H01_DL    2.375553e-01 -4.956400e-02
## H01_DMH  -8.350503e-02 -5.084179e-02
## H01_DP   -1.038233e-01  1.648197e-02
## H01_DU   -5.556789e-01  2.218709e-01
## H01_EAH  -7.122910e+00  2.061907e-01
## H01_EC    7.850942e-02  6.847758e-02
## H01_EH    2.223421e-01 -6.291070e-02
## H01_ER   -2.246952e-02  3.549221e-02
## H01_FASP -1.789588e+00  2.173708e-01
## H01_FB   -9.105757e-01 -1.155929e-01
## H01_FTHR -1.482863e-01 -2.570203e-02
## H01_GR   -3.227311e-01  3.266667e-01
## H01_GS   -1.716337e+00 -1.706905e-01
## H01_GYSP -3.969697e-01  7.068655e-02
## H01_IR   -7.441518e-01  1.183389e-01
## H01_LH   -3.661752e-01  1.698636e-01
## H01_LI   -3.632474e+00 -2.412018e-02
## H01_LS   -4.700083e-02  1.300001e-01
## H01_LX   -5.695952e-01 -2.920796e-02
## H01_MH    2.851963e+00 -1.691651e-02
## H01_NIE  -3.796154e-02 -1.723835e-02
## H01_PH    8.275629e-01 -1.264264e-01
## H01_PHSE -1.926180e+00 -9.363851e-02
## H01_PL    7.445800e-02 -2.318431e-03
## H01_POLA -2.316144e-01 -2.666334e-02
## H01_PRSP -1.888217e-01 -6.964450e-05
## H01_R    -2.856247e-01  1.963927e-02
## H01_RAT  -1.874272e-01  3.119916e-02
## H01_SAFU -5.804313e+00 -2.524188e-01
## H01_SAHO -5.482167e+00  2.350453e-01
## H01_SAMU -7.142075e-02  2.244000e-03
## H01_SCCA -4.907491e-02  1.676260e-02
## H01_STIN -1.577110e+01 -6.620633e-02
## H01_TALE -3.913926e-01  2.607905e-01
## H01_UEC  -2.855413e+00  1.909605e-01
## H01_UV   -1.371414e+00 -4.323153e-02

In the above, the prefix indicates that the coefficient relates to the Bernoulli part of the hurdle model.

Ordination plot can then be generated as per usual:

ordiplot(fit, jitter = TRUE, s.cex = .8)

References

Korhonen, P., F.K.C. Hui, J. Niku, and S. Taskinen. 2023. “Fast and Universal Estimation of Latent Variable Models Using Extended Variational Approximations.” 33: 26. https://doi.org/10.1007/s11222-022-10189-w.
Korhonen, P., F.K.C. Hui, J. Niku, S. Taskinen, and B. van der Veen. 2024. “A Comparison of Joint Species Distribution Models for Percent Cover Data.” https://doi.org/10.48550/arXiv.2403.11562.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.